Facticity as the amount of self-descriptive 
information in a data set 
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Abstract — Using the theory of Kolmogorov complexity the 
notion oi facticity ip{x) of a string is defined as the amount of 
self-descriptive information it contains. It is proved that (under 
reasonable assumptions: the existence of an empty machine 
and the availability of a faithful index) facticity is definite, i.e. 
random strings have facticity and for compressible strings 
< (p{x) < 1/2 X- 1 +0(1). Consequently facticity measures the 
tension in a data set between structural and ad-hoc information 
objectively. For binary strings there is a so-called facticity 
threshold that is dependent on their entropy. Strings with facticty 
above this threshold have no optimal stochastic model and 
are essentially computational. The shape of the facticty versus 
entropy plot coincides with the well-known sawtooth curves 
observed in complex systems. The notion of factic processes 
is discussed. This approach overcomes problems with earlier 
proposals to use two-part code to define the meaningfulness or 
usefulness of a data set. 

Index Terms — facticity, useful information, Kolmogorov com- 
plexity, two-part code optimization, nickname problem 



I. Introduction 

ALL known formal measures of information (Shannon 
1 11], Kolmogorov 02311 . Fisher 01511 ) assign the highest 
information value to data sets with maximum entropy. This 
implies that a television broadcast with pure noise is the most 
information-rich program we can watch. This obviously does 
not cover our intuitions about what useful information is. In 
the past decennia there have been a number of competing pro- 
posals to define a formal unit of measurement of meaningful 
or useful information. 

• Esthetic Measure (Birkhoff, Bense 

• Sophistication (Koppel, OgOll . ||4[], 

• Logical Depth (Bennet, ||6[]) 

• Statistical complexity (Crutchfield, Young , 012 1 

CI) 

• Effective complexity (Gell-Mann and Lloyd, 01711 ) 

• Meaningful Information (Vitanyi, 1 2911 ) 

• Self-dissimilarity (Wolpert and McReady, Bill ) 

• Computational Depth (Antunes et al., ||5|]) 

Three intuitions dominate the research. A string is 'in- 
teresting' when ...: 

• A certain amount of computation is involved in its 
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creation (Sophistication, Computational Depth). 

• It has internal phase transitions (self-dissimilarity). 

• There is a balance between the model-code and the 
data-code under two part code optimization (effective 
complexity). 
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Such models penalize both maximal entropy and low 
information content, but the exact relationship between 
these intuitions is unclear. Several authors have suggested to 
use Rissanen's notion of Minimum Description Length (MDL) 
(l'25'l, |'2?1) and the theory of Kolmogorov complexity (f?^) 
as building blocks for a theory of meaningful information 
( 02311 . 01711 ). This idea is already implied in Kolmogorov's 
structure function 112811 . There are fundamental problems with 
this approach ( ll24ll ). One quality a theory of facticitjy ought to 
have is that is should give us a guarantee that it is an objective 
measure. It should be definite: 1) It should extract all model 
information from a data set but 2) not more. To my knowledge, 
none of the proposals for meaningful information made so far 
in the literature have been proved to be definite. 

II. Definitions 
A. Kolmogorov Complexity 

We will follow the standard textbook of Hopcroft, Motwani 
and Ullman for the basic definition of a Turing machine 
(TM)(fl91). T will be the set of all possible descriptions 
according to this formalism. 

Definition 1 (Self-delimiting Code): Let a; be a binary 
string, the self-delimiting code for x is defined as the binary 
string X — 0'°^'^lcx where c = |a;| is (a binary representation 
of) the length of x. We have |a;| = c + 2 lege + 1. The self- 
delimiting code for the empty string is e = 1. 
An example: the self-delimiting code for "11001110" is the 
concatenation of "000"," 1"," 100" and"l 1001 110". Note that 
self-delimiting code in this sense is prefix-free: strings of 
different length get a different prefix. Throughout this paper 
we will assume a reference universal Turing machine U with 
prefix-free indexes has been chosen. If C/(?p) = x, then 1 is 
the prefix-free index of a Turing machine Ti to be emulated 
by U and p is a input string for Ti. Note that the input string 
p is not prefix-free, a feature that is essential for the results 
in this paper In short: U{ip) ~ Ti{p) — x All definitions in 
this paper refer to the preselected machine U which we will 
refer to as U . Without loss of generality we will suppose that 
there is a minimal Turing machine Tq that simply is empty 
and does not compute anything. The index of Tq is the empty 
string e. 



The term 'facticity' is derived from Heidegger, and denotes the unexplain- 
able 'givennes' of reality. The roots of this notion are theological: 'factum 
est': it has been made. Conf. "All things were made by him; and without him 
was not any thing made that was made": "Omnia per ipsum facta sunt, et sine 
ipso factum est nihil, quod factum est" (Gospel of St. John: 1,1,3). I think 
the term is appropriate because the theory of facticity is as close as we will 
ever come to a pure mathematical theory of creation and creativity. 



Lemma 1 (Swap-machine): There exists a special machine 
with index s that simply swaps the index and the input, i.e. 
For all p if Uiip) — x then: 

U{s pi) — Uiip) = X 

Proof: This follows from the fact that U is universal. D 

Definition 2: Let a: be a binary string and let [/ be a 
universal Turing machine. The optimal code for x is the 
shortest code that generates x on U: 

X* = minji : U{i) = x} 

i 

The length of the optimal code defines the classical Kol- 
mogorov complexity: 

Definition 3: The Classical Kolmogorov complexity of a 
binary string is: C{x) — min{|i| : U{i) — x} 

Definition 4: The prefix-free Kolmogorov complexity of a 
binary string is: K{x\y) = min{|7| : U{iy) = x} 
We define: 

Definition 5: K{x) = K{x\e) 
This is in fact a one-part code optimization variant Ki of 
Kolmogorov complexity that forces all complexity of the 
information to be stored in the index of the Turing machine. It 
is useful to distinguish a two-part code optimization variant: 

Definition 6: K2{x) — iiiini_p{|7| + \p\ : U{ip) = x} 
This version balances the information over an index i and a 
program p for x. Here 1 is the self-delimiting code of an index 
and [/ is a universal Turing machine that runs program p after 
interpreting i and e is the empty string. The reason to use i 
lies in the fact that it allows us to separate the concatenation 
ip into its constituent parts, i andn. Here i can be seen as 
capturing the regular (structural ll28ll . meaningful Oj], model 
11811 . effective jlTIl ) part of the string a;, where p describes the 
irregular part. 

The following lemmas show that two-part code is really 
more expressive than its one-part variants. 

Lemma 2: For all x we have K{x) > K2(x). 
Proof: Suppose K2{x) > K{x). We have K2{x) = 
min,_p{|i| + \p\ : U{ip) = x} , K{x) = minj,e{|j| : C/(je) = 
x} i.e. \i\ + \p\ > |j|, but then K2{x) — minj^e{|j| + : 
[/(je) = x} = K{x). D 

Lemma 3: For all x we have C{x) = K2{x). 
Proof: We have K2{x) = mini^p{|i| + \p\ : U{ip) — x}, 
C{x) — mmq{\q\ : U{q) — x}. Note that both C and K are 
defined with respect to the same prefix-free Turing machine 
U, which implies that ip = q. D 

The elegance of the introduction of an empty machine is 
illustrated by: 

Lemma 4: For any random string x we have K2{x) = |a;| + 
1. 

Proof: Suppose x is random. In this case it cannot be 
compressed and the empty machine Tq with index e is the 
best model of length 0. We have Uiex) = x and thus 
K2{x) = |a;| + 1. D 

There exists a so-called universal distribution m with 
m{x) = 2~^^(^^+'^(^^\ This distribution dominates any 
recursive distribution by a multiplicative constant ll23ll . 



B. Information theory 

We will follow the standard textbook of Cover and Thomas 



for the basic definitions of Information theory ( 111 111 '). 

Definition 7: A binary source code C for a random variable 
X is a mapping from X, the range of X, to {0, 1}*. Let G{x) 
denote the codeword corresponding to x and let l{x) denote 
the length of C{x). The expected length L{C) of a source 
code C{x) with probability mass function p{x) is given by 

The following lemma (jllll. lemma 5.8.1) is important: 

Lemma 5: For any distribution, there exists an optimal 
instantaneous code (with minimum expected length) that sat- 
isfies the following properties: 

> The lengths are ordered inversely with the probabilities 
(i.e. if pj > pk then Ij < Ik). 

m The two longest codewords have the same length. 

« Two of the longest words differ only in the last bit and 
correspond to the two least likely symbols. 
For a system of messages S with {si : si, S2, •■•, s„}, the 
Shannon entropy is defined as 

n 
H{S) = -^p{s^)log2p{s^) 

1=1 

Here p{si) is the probability of message Si. The following 
concept is useful: 

Definition 8 (Inverse Entropy): The entropy for binary 
strings based on a system of messages S with probability p is 
H{p) = —p\ogp — (1 — p) log(l ~p). The inverse entropy on 
the interval s G [1/2, 1] can be estimated using 

1-1/ ^ 1 — s 

^ ^^'^~W{^~ W{l-s) 

here W{x) is the product log function. Because of the sym- 
metry of the entropy versus probability plot, we can use the 
value p = H~^{s) = 1 — H'~^{s) that is defined on the 
interval [0, 1/2] to find the probability p associated with a 
certain entropy. 

Definition 9: A stochastic binary string is a binary string 
generated by a system of messages S — {0, 1} with a certain 
entropy H{S) < 1. 

Stochastic binary strings define the connection between Shan- 
non information and Kolmogorov complexity: 

Lemma 6: For a stochastic binary string of length k we 
have in the Umit K{x) = H{x) = kH{S). 
Proof: We use a result from 111 ill (Theorem 5.4.2.). Since 
S = {0,1} is a stationary stochastic source the expected 
code length per symbol is H{S), which gives kH{S) as 
optimal compression length. Since x is stochastic we have 
H{x) + |z| = K{x), where i is the index of a program that 
prints X on the basis of an optimal code. Since |7| is constant 
in the limit we have H{x) — K{x). D 

An useful lemma is: 

Lemma 7: Almost all compressible binary strings are 
stochastic. 

Proof: Consider the set of binary strings X^ of length n with 
k zeros. These strings can be enumerated using an index of 
length log2 (^') . A vanishing fraction of these indexes is itself 



compressible. A program of p constant length transforms the 
indexes in to the original strings giving K{x) < log2 (^) + \p\ 
for elements of XJ}. The majority of strings in XJ} will be 
typical: K{x) = log2 (^!) + \p\- In the limit the contribution 
of \p\ vanishes. D 

These limit results are rather rough and can be refined 
once we have formulated a definite measure for the model 
information in a string. 

III. Facticity 

A problem that has hampered all proposals to define mean- 
ingful information in terms of two-part codes is the fact that 
recursive indexes for Turing machines never reflect the true 
model information . This is known as the nickname-problem 
( II17II . [161. 1241 ). The following lemma's show that we can 
in almost all cases assume that the index function we use is 
faithful. 

Definition 10 (Faithful index function): Let / be an index 
function that gives an index for all elements of a set of 
descriptions of all Turing machines T according to some 
formalism (say the one introduced in Hopcroft and UUmann). 
/ is faithful to T iff we have for afl V(i £ T) : C{t) < 
\I{t)\ < C'(i) + 0(1), i.e. the length of the index reflects the 
Kolmogorov complexity of the machine within a constant. 

Lemma 8 (Existence of faithful index): For every universal 
Turing machine U a faithful index set / exists, but it is not 
recursive. 

Proof: Take /' to be the set of shortest programs for elements 
of T that generate descriptions (i.e. U{i') — t) and let q 
be a program that interprets elements of qi £ I such that 
U{q ip) — Ti{p), i.e. q reads in i, converts it to a description 
t and emulates t on [/ in order to process p. For all elements 
ifqiel we have C[t) < \qi\ < C{t) + 0(1). Since C{t) is 
not computable such an index can never be recursive. D 

In the following paragraphs we will assume that the index 
functions are faithful. The following lemma shows that this is 
in almost all cases a good assumption: 

Lemma 9 (Recursive definitions are almost always faithful): 
Let qhe a optimal program that generates a unique string Xn 
on U for each natural number n, then for almost all n the 
program qxn is a faithful index for U{qxn)- 
Proof: The faithfulness condition is C{U{qXn)) < \qxn\ < 
C{U{qXn)) + 0(1). The first inequality holds by definition. 
Suppose that \qxn\ > C{U{qXn)) + 0(1). In this case, since 
q is optimal, U{qXn) is compressible below \qxn\ — 0(1). 
Since the density of compressible strings in the limit is zero, 
this event is extremely rare. D 

We cannot design recursive index functions that are sys- 
tematically non-faithful. The facticity of a string x is defined 
as: 

Definition 11 (Facticity): 

^u{x) = min{|i| : 3{p){\lp\ = K2{x) & U{ip) = x)} 

i.e. the length of the shortest model code of all optimal 
models under two-part code optimization. Note that the ad- 
ditional code length necessary to make the model prefix-free 



is not taken in to account in the definition of facticity. This 
seems reasonable since this code is not part of the content 
of the model per se. Intuitively ipix) is a measure of how 
'interesting' or 'useful' the string x is. Note that there might be 
different models that produce the same facticity. We generally 
feel that random data-sets do not contain much meaningful 
information. This behavior of facticity is illustrated by the 
following theorem: 

Lemma 10: For any random string x we have 'p{x) — 0. 
Proof: This follows from lemma H] Suppose x is random. We 
have Uiex) — x and thus 'p{x) = |e| = 0. D 

Note that the reverse selection p — Q and i — x would 
penalize the total code length with a factor 0(loga;), so this 
choice is never made. This lemma shnows that, in terms of 
facticity, random strings are not meaningful. The reverse of 
this lemma is: 

Lemma 11: All compressible strings have a non-empty 
model: if K2{x) < \x\ then (p{x) > 0. 

Proof: Suppose x is compressible and that (p{x) = 0. We 
have K2{x) ~ mini^j,{|£| + \p\ : U{ep) = x. Since p is the 
shortest possible code it is random, moreover, after processing 
the first part of the input e the machine U will not start a new 
computation. Consequently x = p. This contradicts the fact 
that X is compressible. D 

On the other hand the maximal amount of meaningful 
information in a string is limited to half its length plus a 
constant: 

Lemma 12: For any string x we have ^p{x) < l/2K2{x) + 
\s\ where s is the swap-machine. 

Proof: K2{x) = m.mi^p{\i\ + \p\ : U(ip) — x}. Suppose 
^p{x) > l/2K2{x) + |s| then there are p and i such that 
U{spi) — U{ip) — X, but then since \i\ 3> \p\ we have 
|spi| < \ip\ + |s| — K2{x) + |s| and consequently |spi| < 
K2{x) which contradicts the fact that K2 gives the length of 
the shortest code. D 

We now have to prove that facticity really covers the concept 
of the exact amount of meaning in a string. This is summarized 
in the main theorem of this paper: 

Theorem 1 (Facticity is definite): If a string x is compress- 
ible then < f{x) < 1/2|2;| + 0(1). If x is random then 
ip{x) = 0. 
Proof: this follows from lemma's [TOl [TTI and [T2l D 

This shows that facticity is actually the result of a balance 
between certain tensions in the data set: If the model code 
becomes too short, we loose to much of the computational 
power of our universal Turing machine, if it becomes too 
long, the price for separating the program code from the data 
becomes too high. These intuitions seem reasonable. We can 
now give the following informal definition: 

Definition 12: The facticity (p{x) of a string x is the amount 
of self-descriptive information x contains. 

A. Some results 

In this paragraph I investigate the relation between com- 
plexity K2{x) and factictiy (p{x). 

Definition 13: Given a string x with prefix-free Kol- 
mogorov complexity K2(^x) and facticity ip{x) I give the 
following definitions: 



• The randomness deficiency of x is 5{x) = \x\ — K2{x). 

• A string is non-stochastic when ip{x) = K2{x). 

• A string is mixed when < (p{x) < K2{x). 

• The residual entropy of a string is: p(x) — K2{x) — 
((^(x) + 21og(^(a:) + l). 

The distinction between stochastic and non stochastic 
strings is important. The following lemma describes the be- 
havior of the stochastic strings. 

Theorem 2: Binary stochastic strings have, with probability 
1 — e, small optimal models of size log |a;| + log fc + c, where 

e={l-{H-\s)f)'^ 

Proof: According to lemma |5] there is always an optimal 
code for which 1) the two longest codewords have the same 
length 2) two of the longest words differ only in the last 
bit and correspond to the two least likely symbols. For a 
binary stochastic string the probability of the least likely 
blocks of length k will be determined by their density of 
either ones or zeros. Thus, for a stochastic string that is 
long enough, the only relevant model parameters are an 
integer of length log |a::| giving the total number of ones 
in the string and k giving the size of the maximal block 
length. From these two parameters we can estimate the 
density. A program of constant length c gives an optimal 
code. This approach collapses when sampling the strings with 
block-size k when some random longest word blocks do not 
occur in x and the optimal code has to be adapted. H^^{s) 
estimates the smallest probability of either zeros or ones in 
the string. {H^^{s))'' is the lowest probability for a block 
of length k. The probability that this block is not selected is 
1 — (iJ~^(s))'^. The probability that this block is not selected 
in ^ trials is e = (1 - {H-^{s))'')^ . This is the probability 
that the string does not have an optimal model in the sense 
of lemma |5] D 

Note that in the limit K2{x) approaches \x\H{S). When 
H{S) = 1 = s we have H^''-{s) = 1/2, i.e. the probability 
that strings with high entropy have no small models goes 
to zero in the limit. When H{S) = 1/2 = s we have 
H^^{s) ~ 0.1. The probability that a stochastic string has 
a small model increases exponentially with length of the 
string, and for strings of fixed length polynomially with the 
complexity of the string. In the limit (long strings with small 
block-size) the cut-off will be sharp and act like a sort of 
percolation threshold. Above a certain complexity the size of 
optimal models for random strings will decrease rapidly. A 
defining family of curves is what one could call the: 

Definition 14 (Collapse probability): For binary stochastic 
strings we have: 

where k is the block size and the length of a; is |a;| = k2^ . 
The collapse probabilities, that are instantiations of theorem|2] 
specify the probability that a string of length k2^ with a certain 
entropy between zero and one does not have a short optimal 
model when sampled with a block size k. The definition is 
derived from the so-called Coupon collectors problem that 
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specifies the number of trials a collector of n coupons has 
to make to collect all n coupons as 7ilog(n). 

In general one could say that the factors H{S) = s and 
H^^{s) specify a balance entropy and inverse entropy, that 
2*^ specifies the size of a space and k a sampling granularity. 
The Collapse probabilities define the facticity curves directly 
as the following theorem shows: 

Theorem 3 (Facticity threshold): The optimal model size 
for a stochastic binary string x of length k2'' and entropy 
H{S) is: 

Proof: A term of size log k specifies the string length fc2'^ and 
the block size k. A term of length k specifies the density of 
zeros. This allows us to estimate H{S). From this we calculate 
an optimal coding for a string with H{S). The term $(iJ(S')) 
gives us the probability that some of the 2'^ possible blocks 
do not occur in x, giving 2'^^{H{S)) as the size of the set of 
blocks that do occur in x. An optimal index for this set has 
complexity 

A program of length c computes an optimal model on the 
basis of these data. D 

Lemma 13: The maximal facticity of stochastic strings of 
length n is 0{n/ \og{n)). 

Proof: Direct consequence of theorem |3] Note that the 
maximal complexity of (p{x) = 2'' + k + log fc + c is reached 
for the value $(if (S*)) = 1/2. Take fc2'= = n. D 

Another important insight is that the density of a facticity 
plot is completely determined by the sampling probability 
distribution. If we use some simple high entropy process for 
the generation of examples all our strings will have small 
models and high entropy. The interesting plots are generated 
by computational procedures: 

Lemma 14: A sample taken under the universal distribution 
TO from the total set of binary strings of length k will have 
uniform distribution over the complexity interval [-^'2(2;) = 
0,K2{x)^k]. 

Proof: Direct consequence of Levin's coding theorem: 
m{x) = 2~(^(^^+'^(^^'. The exponential decay of to, is up to 
a constant factor equal to the increase of density of binary 
strings with K{x), so the exponential decay in probability is 
balanced by an exponential increase in density. D 

Note that the universal distribution dominates any recursive 
distribution up to a multiplicative constant. Remember that the 
structure of a model is K2{x) ~ {ip{x)+2\ogLp(x) + \)-\-p{x). 
Theorem |3] gives an optimal model size for x, but in general 
this size will not be reached because of the penalty of size 
2\og(f{x) + 1. The search algorithm behind K2 will settle 
for an optimal exchange between bits stored in the residual 
entropy and the facticity. Purely non-stochastic strings are 
found on the line tf{x) — K2{x). Their density diminishes ex- 
ponentially with growing K2{x). Anthunes and Fortnow have 
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Fig. 1. Facticity density in the complexity space [0, ii'2(a;) = l^l]. The 
density estimates are compatible with the so-called saw-tooth curves observed 
in various data-sets. 



proved the existence of so-called absolutely non-stochastic 
strings that encode the halting set for all smaller strings [4]. 
So, close to the upper bound <^{x) — |a;|/2 there are still non- 
stochastic strings, although we will with very high probability 
never sample them. Note that we can transform every non- 
stochastic string to a mixed string by simply flipping some 
bits and adding a list of locations of the flipped bits. This 
representation is very inefficient. There will be a band of 
mixed optimal models close to the non-stochastic models. 

We can give a taxonomy of strings under K^ compression 
(See figure [ij: 

1) Strings below the facticity threshold (c.f. theorem |3) are 
either: 

a) Non-stochastic, i.e. the facticty is 'close' to the 
Kolmogorov complexity '^{x) « K2{x) 

b) Purely stochastic, i.e. the density of the strings is a 
sufficient model (see theorem|2]l: ip{x) = log |a:| + 
log k + c 

c) Stochastic, i.e. they are under-sampled in the sense 
of theorem [3] 

2) Strings above the stochasticity threshold (c.f. theorem[3]l 
are in principle non-stochastic. They are either: 

a) absolutely non-stochastic, i.e. non-computable (c.f. 
iH): if{x) « \x\/2. 

b) Computable, i.e. created by a non-deterministic 
process: ip{x) <^ \x\/'2,. 

The following lemma is interesting: 

Lemma 15 (Existence of a saturation point): For any non- 
stochastic string X with ip{x) = K2{x) we have 

ip{x) <2l"l+2i°gl"l+i 

where u is the optimal index of the smallest universal Turing 
machine U . 

Proof: X is non-stochastic: f{x) = K2{x), there are no 
data. Note that \u\ + 2log\u\ + 1 is the length of the self 
delimiting code for U , as soon as the model becomes longer 
flian ip{x) = 2l«l+2'offl"l+i it is more efficient to prefix U to 
the code as a general computational model and interpret the 
model as data.D 
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Fig. 2. So-called 'edge of chaos' phenomena in different domains (With 
thanks to E. Schultes and the Atlas of Complexity project) On the right the 
same plots re-scaled to the unitsquare. The analysis in this paper suggests that 
density of the plots is more relevant than contour lines. 



This proof is related to the proof of lemma [12] An im- 
mediate consequence of this is that the maximal number of 
meaningful models that is available for pure computational 
structures is limited by: 

nl + it + log u + loglog u + log log log u+ 

where u is the length of the optimal index of the smallest 
universal Turing machine that exists and u+log m+ log log u+ 
logloglogu+.... is the theoretical optimal length of its prefix- 
free index. Longer models will automatically be reinterpreted 
as data without structure. Only up to a certain limit there are 
non-stochastic objects, but there are mixed strings with longer 
model information. 

B. Approximating facticity 

For approximation of facticity we have theoretically the 
same limits as were proved in H: i.e. the facticity is in 
principle non-approximable in finite time. We may compress 
the data set, but there is no guarantee that the randomness 
deficiency of the model is also improved. In practice how- 
ever, any learning or compression algorithm that allows us 
to estimate the balance between the data and the model 
code with reasonable accuracy can be used in real world 
data. Such algorithms include decision tree induction, nearest 
neighbor search, neural networks, grammar induction algo- 
rithms, standard compression algorithms etc. (IJlJ, 031], [10]). 
The philosophical question why learning algorithms that in 
principle are faulty work reasonably well on real life data is 
still a matter of debate (i^, ifisll ). 

Sawtooth plots are well-known in the literature and have 
been observed ernpirically in many different environments 
(e.g. iQ, iQ, lUl, llH, Figure ??). We can now ten- 
tatively interpret these plots as finding an optimal between 
two compression techniques: non-stochastic models and mixed 
models. Further research has to show whether these plots can 
indeed be explained by the theoretical framework developed 
in this paper. The plots have sharp cut-off points in the 
area where non-stochastic models collapse into mixed models 
and vice versa. The point that is in the literature referred 
to as the edge of chaos ([22;]), seems to have no special 
meaning. It is a collection of ad hoc measurements in an area 
where the probability decays exponentially. Close to the point 
(p{x) = \x\/2 there will be so-called absolutely non-stochastic 
objects, that exist, but are never observed for strings of any 
size. 



C. Factic processes 

By nature deterministic processes can not generate new 
information. Information is associated with uncertainty, but 
in a deterministic process the future is known completely. 
On the other hand thermodynamic processes seem to increase 
information in an uncontrollable manner The amount of infor- 
mation grows but the amount of facticity, or self-descriptive 
information, is minimal in the end. This suggests that, apart 
from growth of entropy there is a second useful way to classify 
processes in terms of the growth of facticity. 

Let X be some system that evolves over time and let xt 
be the binary description of x system at time t: in terms of 
entropic and factic behavior over time we can distinguish the 
following five cases: 



1) Information discarding processes: 



At 



AJ<-(a;t) 
At 



< 



0, 



<0 



2) Self-organizing processes: — ^^ < 0, '^^ > 

3) Reversible processes: ^^ = 0, ^^ = 

4) Random processes: ^^^ > 0, ^^^ < 

5) Factic processes: — ^j > 0, '^^ > 
Let's discuss these cases briefly: 

1) Information discarding processes both decrease entropy 
and meaningful information and thus violate the second 
law of thermodynamics and do not occur in closed 
systems. Standard computing is an example. Recur- 
sive functions are in general information discarding 
functions. When a computation is finished we are left 
with less entropy and meaningful information as before. 
Consider adding two numbers: a + h ^ c. Before the 
computation we have log a -\- log b bits of information, 
after the computation only log(a + b). 

2) Self-organizing processes still decrease entropy and do 
not occur in closed systems. They reduce complexity and 
increase facticity. The growth of plants in a greenhouse, 
or bacteria on a petri dish are examples. 

3) Reversible computation in which we keep track of all 
information (including the meaningful information) is a 
borderline case. In principle such computations could be 
energy neutral in a closed system. 

4) Random processes like flipping a coin or diffusion of 
gases increase entropy but do not generate any new 
facticity. They are studied in the general theories of 
randomness and thermodynamics. 

5) Factic processes maintain a balance between model and 
ad-hoc information, i.e. their data sets can at any time 
both be interpreted as stochastic and non-stochastic. 
Processes that both increase entropy and meaningful 
information have, to my knowledge, not been studied 
well up till now. Still they deserve our attention: learn- 
ing, game playing, the development of exchange rates, 
evolutionary processes and creative processes are factic. 

It is clear that factic processes in closed finite domains 
(e.g. strings of length n) cannot increase model information 
indefinitely (c.f. lemma [T2]|. The existence of absolutely non- 
stochastic uncomputable strings implies that the strings with 
the largest models will not be found in finite time. In the 



limit factic processes in finite domains will slowly die out and 
no increase of model size is possible, no matter how much 
computing time is spent. Factic processes defy some of our 
basic philosophical intuitions as the following theorem shows: 

Theorem 4: In the limit factic processes in infinite domains 
have no stable model. 
Proof: facticity is defined as 

ipu{x) = min{|i| : 3{p){\ip\ = K2{x) & U{ip) = x)] 



> 0, ^^ > 0. 



For factic processes we have — -^^ 
Suppose that at time t an optimal description of xt 
is Ti^t{pt) — Xt and that some time t' later we have 
Ti.t'ipt') = Xf- Because of increasing entropy and facticity 
we have K{it') > K{it) and K{pt') > K{pt)- For a process 
with a sufficient statistic the structural part K{i) will stabilize 
after a certain period of time, pushing all the growth of 
information to the non-structural part p. With factic processes 
for which the domain can be expanded indefinitely this will 
never happen. D 

This insight runs against central intuitions of scientific 
methodology. We always feel that no matter how complex 
a set of phenomena is, if we gather enough data, in the end 
we will be able to come up with a model that explains them. 
For factic processes this is per definition not the case. Any 
model that explains the data now, is sure to fail, at least partly, 
somewhere in the future. |3 In a certain way it is impossible to 
predict the development of a factic process, but the definition 
of facticity itself gives us two rules to guide us when predicting 
the evolution of a factic process: 

« Randomness aversion: A factic process that appears to 

be random will structure itself (the factor H^^{s). 
m Model aversion: A factic process is maximally unstable 
when it appears to have a regular model (the factor 
H{S) - s). 
Note that this also implies that factic processes are disrup- 
tive: i.e. they are maximally unstable when they appear to have 
a model. 

IV. Discussion 

For approximation of facticity we have theoretically the 
same limits as were proved in [2]: i.e. the facticity is in 
principle non-approximable in finite time. We may compress 
the data set, but there is no guarantee that the randomness 
deficiency of the model is also improved. In practice how- 
ever, any learning or compression algorithm that allows us 
to estimate the balance between the data and the model 
code with reasonable accuracy can be used in real world 
data. Such algorithms include decision tree induction, nearest 
neighbor search, neural networks, grammar induction algo- 
rithms, standard compression algorithms etc. (IJlJ, 031], llioll ). 
The philosophical question why learning algorithms that in 

-Note that when factic process reach the Hmit from theorem [Ts] they can 
have no purely non-stochastic models anymore. There will always be some 
residual entropy. 



ip[j{x) 



principle are faulty work reasonably well on real life data is 

still a matter of debate (f?], flS*]). 

We may define a resource bounded version of facticity: 
Definition 15 (t-Facticity): For any time constructable t, the 

t-time bounded facticity of x is: 

.. 3{p){\ip\ = K2{x)tU{ip)=x) 
in at most t{\x\) steps 

Variants of lemma' s [TOl and [TTI hold for any time constructable 
function. Lemma [T2]does not hold for small i(|a;|), i.e. we may 
not have enough computing time to apply the swap function. 
In a previous publication {\}^) normalized facticity was 
defined in terms of classical Kolmogorov complexity C{x) 
and the randomness deficiency of a string: 5{x) = \x\ — K{x) 

Theorem [1] shows that the current definition of facticity covers 
exactly the same intuitions: simple and complex strings are 
penalized and maximal facticity is reached (somewhere) in 
the middle. The present definition lacks the ad-hoc character 
of the previous one and ties the notion of facticity directly to 
the complexity of the optimal model that explains the data. 

Theorem [T] states that iy9 is a good measure for meaningful 
information: it gives the optimal separation between the struc- 
tural and ad-hoc part of the data. All structural information 
is absorbed by the index of the computation. The facticity of 
a string is the result of two opposing forces: 1) the prefix- 
free part of the code is penalized by a 2 log |x| + 1 factor and 
should be as short as possible, 2) concatenation of Turing 
machines reduces complexity so we should store as much 
information in the prefix as possible. In this sense a string with 
high normalized facticity has indeed a high tension between 
structural and ad-hoc information. 

Koppel ( 12011 . 12111 ) defined the notion of sophistication that, 
with hindsight, can be seen as an precursor of facticity: 

Definition 16 (Description): A description of a string x is 
a pair (p, d) such that p is a self-delimiting total program, and 
X is an initial segment of U{p,d). The complexity of x then 
is 

H(x) = rninp^d{\p\ + Ml : (Pid) is a description of x} 

Definition 17 (c-Sophistication): 

{there is a d such that {p, d) 
\p\ : is a description of x and 
\p\ + \d\<H{x) + c 

Here c is a significance level. It is clear that the idea to 
use the prefix code of a two part code optimization as a 
model for a string is already present here. The introduction 
of a significance level makes sophistication less general than 
facticity and there is also no guarantee that sophistication is 
definite in the sense that I have proved above. An idea that is 
closely related is the proposal for coarse sophistication in i^: 

Definition 18 (coarse Sophistication): 



Here the term \p\ might be interpreted as the sophistication 
and \p\ + \d\ — C{x) as a penalty for how far away we are 
from the minimal program. Clearly this measure is rough and 
arbitrary compared to facticity, although the factor 2\p\ might 
be seen as a discount for the program code, much in the same 
spirit as the additional prefix code in the facticity. Many of the 
results for coarse grained sophistication in 112 ill presumably 
can be generalized to hold for facticity, but this is a subject 
for further study. 



In an earlier publication ( 01711 ) Gell-Mann and Lloyd de- 
veloped the notion of the effective complexity of a string 
in terms of an ensemble, or probability distribution, defined 
on all strings. Foley ( lil6ll ) explores the consequences of this 
approach in terms of Bayesian inference. Both approaches 
lead to a view on effective complexity of a string in terms 
of a balance between its Kolmogorov complexity and its 
Shannon information in the ensemble. The exact tradeoff 
between these two notions of information cannot be treated 
adequately in this setting. Gell-Mann and Lloyd 11711 suggest 
K[E] + H[E] = K[x] as additional constraint, Foley fll] 
introduces a temperature parameter a that also has a certain 
ad hoc character in his proposal for a general prior, based 
on a tradeoff between Shannon Information and Kolmogorov 
complexity: 

The results in this note show that a definition of such a 
notion of effective complexity is not only possible within 
the framework of algorithmic complexity, but also that this 
approach is much more concise and leads to better insights in 
to the nature of this phenomenon. 

The proposal of self-dissimilarity as a measure for com- 
plexity by Wolpert and Mcready ([31]) has interesting par- 
allels with facticity. Simple data sets do not contain enough 
information to be self-dissimilar Random data sets are not 
dissimilar enough to be complex in this sense. The exact 
relation between self-dissimilarity and facticity is not clear 
at this moment and the claim in il7ll that self-dissimilarity 
and effective complexity are the same seems premature. My 
conjecture would be that high facticity is a necessary but not 
a sufficient condition for high self-dissimilarity. The reason is 
that we can use facticity to formulate a level of interestingness 
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level of vertical self-dissimilarity such as the proposal in 
captures, i.e. the latter description defines a richer notion of 
complexity. 

Below I discuss some possible objections against the theory: 

• Objection 1) No adequate separation: Vitanyi [29.1 sug- 
gests a connection with the Minimum Description Length 
(MDL) principle ([Q, IH). Let M be the set of prefix- 
free programs. Using Bayes' law, the optimal computa- 
tional model under this distribution would be: 



csophc{x) — minp^d 



2\p\ + \d\^Cix) ■.U{p,d)^x 
and X is total. 



Mmapix) = 'AvgrnaxMeM 



which can be rewritten as: 



m{M)m{x\M) 
m{x) 



aigmiuMeM ^ log?7i(Af) — logTO(a;[M) 



Here — log?7i(Af ) can be interpreted as the length of the 
optimal data-code in Shannon's sense and — log ra{x\M) 
as the length of the optimal data-to-model code. Using 
Levin's coding theorem this can be rewritten as: 



Mraap{x) = a,TgminMeMK (M) + K{x\M) 



(1) 



This gives optimal two-part code compression of x. The 
term K{M) in this expression is of special interest be- 
cause it seems to capture the notion of facticity introduced 
above. 



It is pointed out in Il24tl that, as soon as we try to separate 
meaningful information from non-structured information, 
it is not clear that we can make an objective choice. This 
is reaUy an issue that has to do with the interplay between 
one-part and two-part code optimization. We pay a price 
for the identification of a model: K{M) + K{x\M) > 
K{x). Suppose that we want to model strings in the 
domain V{{i)^ 1}*^) (the set of all sets of binary strings 
of length k). There are two possibilities to formulate an 
optimal model for a string x. 

- Case 1: If the string is random then the optimal 
model M would be {0, 1}'^, with K{M) = 0(1) 
and K{x\M) ~ \x\ + 0(1), i.e. the index of x in 
{0, 1}''. This gives 

^MDl{x) = \Mmap{x)\ =0(1) 

- Case 2: Equally possible in this context would be a 
model M' = {x}, with K{M) = |a;| + 0(1) and 
K{x\M) = 0(1). This gives: 

VMDl{x) = \Mynap[x)\ Ki\x\ 

This shows that the standard MDL formulation does not 
favor short models. It only favors optimal separation 
between structural and non-structural information. The 
conclusion is that standard MDL-Kolmogorov theory can 
not be used as a foundation of the theory of facticity, 
although in practice it works often quite well. In this 
paper I present a version, Lp, that is definite, i.e. it gives 
indeed an objective separation between structural and ad- 
hoc data. 

Objection 2) Pathological Indexes: This has been called 
the 'nickname problem' by Gell-Mann. The universal 
Turing machine that we choose can use pathological 
index functions to select the specific Turing machines. 
Specifically there are choices that give an index of length 
1 to a universal Turing machine U. In this case we get a 
new universal model for one bit. Our MDL code would 
always select this universal model coded in 1 bit and 
continue with standard one-part Kolmogorov code for 
U. Clearly we have to put constraints on the definition 
of the indexes. An ideal index function for measuring 
facticity would have to observe two seemingly conflicting 
conditions: 

- It should reflect the Kolmogorov complexity of the 
definition of the underlying Turing machine and 

- be computable 



I have shown that such an index cannot exist, but that 
'faithfulness' is a reasonable assumption in most cases. 

• Objection 3) The models are not cognitively relevant 
( II24II ). Since Kolmogorov complexity gives the com- 
plexity of individual objects there is no guarantee that 
the part of the description we single out as the model, 
captures aspects that from a more cognitively relevant 
point of view would be seen as model information. If 
we develop an optimal two-part code description of an 
individual horse with three legs, the fact that this animal 
has three legs might well end up in the 'model' part of 
the description, although the ideal horse still would have 
four legs. This is true, but when we only have data about 
a horse with three legs, then this is how it should be. 
Facticity captures the notion of an optimal model from 
an algorithmic point of view. Whether these models are 
cognitively relevant is subject for further study, but given 
their generality one may expect this ([30], [9]). 

• Objection 4) Choice for U introduces a bias. This is 
true but the length of the theories generated by different 
choices of Turing machines always will at maximum 
only be a constant apart (|23]). So asymptotically the 
different measures are still comparable. Suppose there 
exists a universal Turing machine T,; with, an index of 
length i, that would generate a considerable smaller code 
p' explaining a certain string x than our current choice 
Tj with p. We simply compress the program by prefixing 
p' with the self-delimiting code for the index i and feed 
this in to Tj. The proof of theorem [T] shows that given 
a choice for a reference machine U facticity is defined 
with exact precision. 

• Objection 5) There are different models with equal 
facticity. There may be several competing models that 
compress the data equally good. This is actually a feature 
more than a bug. Gestalt switches are a concrete example 
where two incompatible models give an equally good 
interpretation of the data. So we do not want a theory 
that restricts itself to solutions where only one model is 
the best. 

• Objection 6) No physical substrate. "The effective com- 
plexity of a string as a purely formal construct, lacking 
a physical interpretation, is either close to zero, or equal 
to the string's algorithmic complexity, or arbitrary, de- 
pending on the auxiliary criterion chosen to pick out the 
regular component of the string[24|." This observation is 
simply wrong, as the central result in this paper proves. 

V. Conclusion 

In this paper it is shown that it is possible and promising 
to develop a theory of meaningful self-information of strings 
based on Kolmogorov complexity. Such a theory allows us to 
define a concept (of course there are many others possible and 
useful) of meaningful information in terms of facticity. Further 
research will involve an analysis of processes that create data 
sets with high facticity (games, genetic algorithms, the stock 
market, evolution) analysis of existing data sets in the light 
of this theory II12II . 11311 . lll4ll . ll22ll ) and the development of a 
theory of conditional facticity. 
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